66
Algorithms for Binary Neural Networks
∂LAdv p
∂Clp
= −
i
2(1 −Dp(T l
p,i; Yp))∂Dp
∂Clp
.
(3.95)
Furthermore,
∂LData p
∂Clp
= 1
n
i
(Rp −Tp) ∂Tp
∂Clp
.
(3.96)
The complete training process is summarized in Algorithm 4, including the update of the
discriminators.
Algorithm 4 Pruned RBCN
Input: The training dataset, the pre-trained 1-bit CNNs model, the feature maps Rp from
the pre-trained model, the pruning rate, and the hyper-parameters, including the initial
learning rate, weight decay, convolution stride, and padding size.
Output: The pruned RBCN with updated parameters Wp, ˆWp, Mp and Cp.
1: repeat
2:
Randomly sample a mini-batch;
3:
// Forward propagation
4:
Training a pruned architecture // Using Eq.17-22
5:
for all l = 1 to L convolutional layer do
6:
F l
out,p = Conv(F l
in,p, ( ˆW l
p ◦Mp) ⊙Cl
p);
7:
end for
8:
// Backward propagation
9:
for all l = L to 1 do
10:
Update the discriminators Dl
p(·) by ascending their stochastic gradients:
11:
∇Dlp(log(Dl
p(Rl
p; Yp)) + log(1 −Dl
p(T l
p; Yp)) + log(Dl
p(Tp; Yp)));
12:
Update soft mask Mp by FISTA // Using Eq. 24-26
13:
Calculate the gradients δW l
p; // Using Eq. 27-31
14:
W l
p ←W l
p −ηp,1δW l
p; // Update the weights
15:
Calculate the gradient δCl
p; // Using Eq. 32-36
16:
Cl
p ←Cl
p −ηp,2δCl
p; // Update the learnable matrix
17:
end for
18: until the maximum epoch
19: ˆW = sign(W).
3.6.4
Ablation Study
This section studies the performance contributions of the kernel approximation, the GAN,
and the update strategy (we fix the parameters of the convolutional layers and update the
other layers). CIFAR100 and ResNet18 with different kernel stages are used.
1) We replace the convolution in Bi-Real Net with our kernel approximation (RBConv)
and compare the results. As shown in the column of “Bi” and “R” in Table 3.3, RBCN
achieves an improvement in accuracy 1.62% over Bi-Real Net (56.54% vs. 54.92%) using
the same network structure as in ResNet18. This significant improvement verifies the effec-
tiveness of the learnable matrices.
2) Using GAN makes RBCN improve 2.59% (59.13% vs. 56.54%) with the kernel stage
of 32-32-64-128, which shows that GAN can help mitigate the problem of being trapped in
poor local minima.